A Simple Algorithm for Finding the MaximumRecoverable System State in Optimistic Rollback Recovery Methods

نویسندگان

  • David B. Johnson
  • Peter J. Keleher
  • Willy Zwaenepoel
چکیده

In a distributed system using rollback recovery, information saved on stable storage during failure-free execution allows certain states of each process to be restored after a failure. For example, in a system of deterministic processes using message logging and checkpointing, a process state can be restored only if all messages received by the process since its previous checkpoint have been logged. In a system of nondeterministic processes using checkpointing alone, a process state can be restored only if it has been recorded in a checkpoint. Optimistic rollback recovery methods in general record this information asynchronously, assuming that a suitable recoverable system state can be constructed for use during recovery. A system state is called recoverable if and only if it is consistent and the state of each individual component process can be restored. This paper presents a simple algorithm for nding the maximum recoverable system state at any time in a system using optimistic rollback recovery. We show that in such a system, there is always a unique maximum recoverable system state, extending our previous result for deterministic systems using message logging and checkpointing. These new results can be applied both to deterministic and to nondeterministic systems. We have implemented this algorithm on a collection of SUN workstations running the V-System. The algorithm requires no additional communication in the system, and requires little storage for execution.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Asynchronous Optimistic Rollback Recovery Using Secure Distributed Time

In an asynchronous distributed computation, processes may fail and restart from saved state. A protocol for optimistic rollback recovery must recover the system when other processes may depend on lost states at failed processes. Previous work has used forms of partial order clocks to track potential causality. Our research addresses two crucial shortcomings: the rollback problem also involves t...

متن کامل

Independent Checkpointing and Concurrent Rollback for Recovery in Distributed Systems - An Optimistic Approach

Checkpointing in a distributed system is essential for recovery to a globally consistent state after failure. In this paper, we propose a solution that benifits from the research in concurrency control, commit protocols, and site recovery algorithms. A number of checkpointing processes, a number of rollback processes, and computations on operational processes can proceed concurrently while tole...

متن کامل

Efficient Transparent Optimistic Rollback Recovery for Distributed Application Programs

Existing rollback-recovery methods using consistent checkpointing may cause high overhead for applications that frequently send output to the “outside world,” since a new consistent checkpoint must be written before the output can be committed, whereas existing methods using optimistic message logging may cause large delays in committing output, since processes may buffer received messages arbi...

متن کامل

A Fast Rollback-Recovery Scheme based on Optimistic Message Logging

This paper presents an eecient rollback recovery scheme based on the optimistic message logging. To speed up the recovery process, the rollback point of the failed process is broadcast and other processes asynchronously make the rollback decision based on the vector time. Asynchronous recovery process usually causes two possible problems: One is the message delivered from an invalid state inter...

متن کامل

Completely Asynchronous Optimistic Recovery with Minimal Rollbacks

Consider the problem of transparently recovering an asynchronous distributed computation when one or more processes fail. Basing rollback recovery on optimistic message loggingand replay is desirable for several reasons, including not requiring synchronization between processes during failure-free operation. However, previous optimistic rollback recovery protocols either have required synchroni...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1990